LDA assumes that the data points of each category are normally distributed.
Once the data is projected, LDA identifies another line and effectively separates the groups.
This separator can be used as a classifier or a dimension reduction technique.
When more variables are present, this separator can still be used regardless of the dimensions. (see below)
# fit the LDA model and retrieve the estimate for B
lda.obj = lda(type ~ ., data = reduced_bc)
Bhat = lda.obj$scaling
# Find the scores - the original observations in terms of our LD directions
lda.scores = as.matrix(reduced_bc[,-1])%*%lda.obj$scaling
# Only use the first two scores for our 2D plot
lda.scores.forplot = as.matrix(reduced_bc[,-1])%*%Bhat[,1:2]
# Hence, we will use this data to make our plot
data.forplot = data.frame(lda.scores.forplot, Type = reduced_bc$type)
# plot the scores and color them appropriately!
ggplot(data.forplot) +
geom_point(aes(x = LD1, y = LD2, color = Type), size = 2) +
ggtitle("Method: LDA") +
theme_bw()
LDA works best when the predictors are linear, normally distributed, and the response is categorical.
Goal of LDA: to clearly separate the different categories’ data points on the lower dimension.
However, it is not as effective as t-SNE and UMAP when the dimension or number of variables is very high.
t-SNE starts by measuring each data point similarity in both the high-dimensional space as well as the low-dimensional space.
In the high dimensional space, we assume a normal distribution, but in the low dimensional space, we assume a t-distribution.
Lastly, using Kullback-Liebler divergence makes the similarity matrices in the lower and higher dimensions more similar and easy to understand.
Linked below is an interactive website that provides a clear visualization of t-SNE:
https://pair-code.github.io/understanding-umap/
# run the TSNE
tsne <- Rtsne(as.matrix(reduced_bc[, c(-1)]),
dims=2, perplexity=15, verbose=FALSE, max_iter=5000)
# color and name the groups
colors <- rainbow(length(unique(reduced_bc$type)))
names(colors) <- unique(reduced_bc$type)
# make the plot
plot(tsne$Y, t='n',
main="Method: tSNE", xlab="tSNE Dimension 1", ylab="tSNE Dimension 2",
xlim=c(-15, 15), ylim=c(-20, 20), cex.main=1.2, cex.lab=1)
points(tsne$Y[,1], tsne$Y[,2], col=alpha(colors[reduced_bc$type], 0.8), cex=0.9, pch=19)
legend("topright",legend=unique(reduced_bc$type), col=colors,
pch=19, bty="n", pt.cex=1.2, cex=0.6, text.col=colors, horiz=F,
inset=0.01, y.intersp=1.2)
For unsupervised and nonlinear dimensionality reduction technique
It is well suited for embedding high dimension data into lower dimensional data (2D or 3D) for data visualization.
This technique is typically used for exploration.
This concept is very similar to t-SNE; however, instead of measuring the similarity between all the points, it creates a graph where only similarity between adjacent points is needed.
Similarly, it creates a lower-dimensional graph, which is optimized to look similar to the high-dimensional graph.
Linked below is an interactive UMAP website that gives more in-depth explanations of the tuning parameters:
https://pair-code.github.io/understanding-umap/
# split the label and data separately
reduce_bc_label = reduced_bc$type
reduce_bc_data = reduced_bc[,-1]
# run the umap function
reduce_bc_umap = umap(reduce_bc_data)
# UMAP plot written as a function
plot.reduce.bc = function(x, labels,
main="Method:UMAP",
colors=c("#ff7f00", "#e377c2", "#17becf"),
pad=0.1, cex=0.6, pch=19, add=FALSE, legend.suffix="",
cex.main=1, cex.legend=0.85){
layout = x
if (is(x, "umap")) {
layout = x$layout
}
xylim = range(layout)
xylim = xylim + ((xylim[2]-xylim[1])*pad)*c(-0.5, 0.5)
if (!add) {
par(mar=c(0.2,0.7,1.2,0.7), ps=10)
plot(xylim, xylim, type="n", axes =F, frame=F)
rect(xylim[1], xylim[1], xylim[2], xylim[2], border="#aaaaaa", lwd=0.25)
}
points(layout[,1], layout[,2], col=as.integer(as.factor(labels)),
cex=cex, pch=pch)
mtext(side=3, main, cex=cex.main)
labels.u = unique(as.factor(labels))
legend.pos = "topright"
legend.text = as.character(labels.u)
if (add) {
legend.pos = "topright"
legend.text = paste(as.character(labels.u), legend.suffix)
}
legend(legend.pos, legend=legend.text, inset=0.03,
col=as.integer(labels.u),
bty="n", pch=pch, cex=cex.legend)
}
# run the umap plot
plot.reduce.bc(reduce_bc_umap,reduce_bc_label)
UMAP is an optimized extension of the tSNE, which allows any situations that can use tSNE to also utilize UMAP.
This method is less computationally extensive because it only considers the “neighbors” of each point when calculating the similarity measure and creates a graph accordingly.